1 Introduction

For this project the data from the “VII Encuesta de Presupuestos Familiares” (VII Household Budget Survey) was selected. This is a survey done every 2 to 3 years in Chile, that contains data about household’s inhabitants, some of their social characteristics as income, education, gender, age, etc., and the expenses each household have within a month. The data is splited in two data sets, one with the inhabitant description and income variables, with some groupping categories by households (households); the other with the expenses reported in several categories by household id (expenses).

Some data wrangling will be needed before starting some of the exploratory data analysis, given that the households data set have one entry by each household member, while the expenses data set contains the expenses only by household not separated by household inhabitant: this means that is possible to merge the data by matching the household ids, but not by individuals (because that was the intended use for the data).

2 Loading libraries and data set

The following libraries were used for this work:

The data is stored in RData files, after being transformed from SPSS data sets.

load("households.RData")
load("expenses.RData")

Some cleaning is still needed, for example there two negative ages and some households without a total income reported, because of missing data.

households <- subset(households, age >= 0 & !is.na(income.hh.av.rent))

3 Univariate Analysis.

Let’s start with some simple explorations of the population in our data set. From the variables descriptions we decided to focus on the following variables:

3.1 Household’s inhabitants Age Distribution

What is the population’s age distribution? The summary function can give us a start:

Statistic Value
Min. 1
1st Qu. 17
Median 32
Mean 34.9
3rd Qu. 51
Max. 103

The next figure shows an histogram using the age variable (a discrete numerical variable). The binwidths are equal to 1 year. The figure shows that the population is not normally distributed and positevely skewed overall. This is expected, since the population must decrese with age as people dies by accidents, illness or natural causes.

However is interesting to notice some peaks at around 5, 25 and 50 years of age: they might correspond to generations with higher natallity rates or less infant mortality.

What about the education attainment distribution?

Statistic Value
Min. 2.96
1st Qu. 653.4
Median 1110
Mean 1707
3rd Qu. 1976
Max. 53280

Let’s change the x scale so we can see more details on the distribution if the household’s income.

Table with percentiles calculated to create Income Deciles
Percentil Value
0% 2.96
10% 419.9
20% 576.2
30% 739.1
40% 908.1
50% 1110
60% 1369
70% 1714
80% 2310
90% 3610
100% 53280

So, while the mean household income is US$1707 the median is at US$1110. How it does compare with other countries?

Statistic Value
Min. 1
1st Qu. 2
Median 3
Mean 3.389
3rd Qu. 4
Max. 15

Statistic Value
Min. 2.402
1st Qu. 27.24
Median 44.36
Mean 80.98
3rd Qu. 91.5
Max. 2279

4 Bivariate/Multivariate Analysis.

We can arrange a little bit more this plot and create a population pyramid:

The peaks seem to change for each gender! We can also notice that there are more women (53.2%) than men (46.8%). Are the gender’s average age different?

We see a difference in the average age for both gender, with males having an overall younger population.

  Min. 1st Qu. Median Mean 3rd Qu. Max.
Men 1 16 30 33.49 50 101
Women 1 18 34 36.15 52 103

Let’s test if the difference in statistical significant by using the Wilcoxon Rank Test:

Wilcoxon rank sum test with continuity correction: age by gender
Test statistic P value Alternative hypothesis
143282477 2.947e-28 * * * two.sided

What about the education attainment distribution? We will study the distribution using an stacked histogram, so we can study at the same time if there are any significant differences between genders.

We see a pick in the distribution at category 5, which correspond to the primary education. This doesn’t mean that most of the population only reach primary school: we have not removed the school-age population. We could use two variables to subset the data and include only the population that is no longer studying:

As shown in the Introduction

What are households expending on?

D Code Description
01 Food and non-alcoholic beverages
02 Alchoholic beverages, tobacco and narcotics
03 Clothing and footwear
04 Housing, water, electricity, gas and other fuels
05 Furnishings, household equipment and routine household maintenance
06 Health
07 Transport
08 Communication
09 Recreation and culture
10 Education
11 Restaurants and hotels
12 Miscellaneous goods and services